A quick analysis of Baltimore crime

I’m going to do a very simple analysis of Baltimore crime to show off R. We’ll use data downloaded from Baltimore City’s awesome open data site (this was downloaded a couple of years ago so if you download now, you will get different results).

Getting data

Let’s load the data:

arrest_tab=read.csv("BPD_Arrests.csv", stringsAsFactors=FALSE)
cctv_tab=read.csv("CCTV_Locations.csv", stringsAsFactors=FALSE)

# these columns are mislabeled, so fix them
tmp=arrest_tab$sex
arrest_tab$sex=arrest_tab$race
arrest_tab$race=tmp

Exploring data

# dimension of table (data.frame)
dim(arrest_tab)
## [1] 104528     15
# what are the columns
names(arrest_tab)
##  [1] "arrest"            "age"               "sex"              
##  [4] "race"              "arrestDate"        "arrestTime"       
##  [7] "arrestLocation"    "incidentOffense"   "incidentLocation" 
## [10] "charge"            "chargeDescription" "district"         
## [13] "post"              "neighborhood"      "Location.1"
# what is the average arrest age?
mean(arrest_tab$age)
## [1] 33.19639
# the range of arrest ages
range(arrest_tab$age)
## [1]  0 87
# how many arrests per sex
table(arrest_tab$sex)
## 
##           F     M 
##     2 19431 85095
# what are the most common offenses
head(sort(table(arrest_tab$incidentOffense),decreasing=TRUE))
## 
##         Unknown Offense            87-Narcotics       4E-Common Assault 
##                   38649                   24744                    6739 
## 87O-Narcotics (Outside)     97-Search & Seizure                79-Other 
##                    6515                    3670                    3461
# what are the offenses that only happen once
tab <- table(arrest_tab$incidentOffense)
tab[tab == 1]
## 
##            13-Assist Officer          2C-Carnal Knowledge 
##                            1                            1 
## 3EF-Robb Gas Station-Firearm   3EK-Robb Gas Station-Knife 
##                            1                            1 
##       62-Person Lying On St.            64-Drug Free Zone 
##                            1                            1 
##      6K-Larceny- Park. Meter      6L-Larceny- From Locker 
##                            1                            1 
##      70-Sanitation Complaint     8FO-Arson Public Str-Occ 
##                            1                            1 
##        8I-Arson Other Mobile       94-Abduction By Parent 
##                            1                            1
# range of arrests after removing those w/ age==0
range(arrest_tab$age[arrest_tab$age>0])
## [1]  8 87

Offenses by sex

tab <- table(arrest_tab$incidentOffense, arrest_tab$sex)

Let’s see a table of arrests by sex and race

table(sex=arrest_tab$sex,race=arrest_tab$race)
##    race
## sex           A     B     H     I     U     W
##         2     0     0     0     0     0     0
##   F     0    37 14663     0    34   183  4514
##   M     0   205 72605     1   184  1566 10534

A histogram of age

hist(arrest_tab$age,nc=100)

with(arrest_tab,hist(age[sex=="M"],nc=100)) # males only

with(arrest_tab,hist(age[sex=="F"],nc=100)) # females only

Are males and females arrested at different ages on average?

Let’s take a look at how age depends on sex. Let’s plot age as a function of sex first (notice how we indicate that sex is a factor).

plot(arrest_tab$age~factor(arrest_tab$sex))

One of the neat things about R is that statistical model building and testing is built-in. The model we use is \(y_i=\beta_0+\beta_1 x_i\) where \(y_i\) is age of sample (example) \(i\) and \(x_i\) is an indicator variable \(x_i \in \{0,1\}\) with \(x_i=1\) if the \(i\)-th record (example) is male. You can check that \(\beta_1\) is the difference in mean age between females and males. We use the formula syntax to build a linear regression model.

# let's ignore those records with missing sex
fit=lm(age~factor(sex),data=arrest_tab,subset=arrest_tab$sex %in% c("M","F"))
summary(fit)
## 
## Call:
## lm(formula = age ~ factor(sex), data = arrest_tab, subset = arrest_tab$sex %in% 
##     c("M", "F"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.388 -10.153  -3.153   9.612  53.847 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.38778    0.08472 394.117   <2e-16 ***
## factor(sex)M -0.23432    0.09389  -2.496   0.0126 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.81 on 104524 degrees of freedom
## Multiple R-squared:  5.958e-05,  Adjusted R-squared:  5.002e-05 
## F-statistic: 6.228 on 1 and 104524 DF,  p-value: 0.01257

We see that \(\beta_1 \approx -0.2\) meaning that the arrest age for males is about 2.5 months younger. So there is very little difference in the average age (which is what the linear model is testing) but we see that the probability of observing this difference from a sample of this size when there is no difference in average age is small \(p \approx 0.01\). Since we have a very large number of examples, or records, this testing framework will declare very small differences as statistically significant. We’ll return to this theme later in class.

Geographic distribution of arrests.

First we need to extract latitude and longitude from location, we’ll use some string functions to do this

tmp=gsub("\\)","",gsub("\\(","",arrest_tab$Location))
tmp=strsplit(tmp,split=",")
arrest_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
arrest_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))

Now let’s plot

plot(arrest_tab$lon, arrest_tab$lat, xlab="Longitude", ylab="Latitude", main="Arrests in Baltimore")

We can also use density estimates to make this nicer:

smoothScatter(arrest_tab$lat, arrest_tab$lon, xlab="Latitude", ylab="Longitude", main="Arrests in Baltimore")

Let’s make this fancier using the ggplot2 graphics systems and the maps package containing map data.

library(maps)
library(ggplot2)

balto_map = subset(map_data("county", region="maryland"),subregion=="baltimore city")
plt=ggplot()
plt=plt+geom_polygon(data=balto_map,aes(x=long,y=lat),color="white",fill="gray40")
plt=plt+geom_point(data=arrest_tab,aes(x=lon,y=lat),color="blue",alpha=.1)
print(plt)
## Warning: Removed 40636 rows containing missing values (geom_point).

Now let’s add CCTV cameras.

tmp=gsub("\\)","",gsub("\\(","",cctv_tab$Location))
tmp=strsplit(tmp,split=",")
cctv_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
cctv_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))

plt=ggplot()
plt=plt+geom_polygon(data=balto_map,aes(x=long,y=lat),color="white",fill="gray40")
plt=plt+geom_point(data=arrest_tab,aes(x=lon,y=lat),color="blue",alpha=.1)
plt=plt+geom_point(data=cctv_tab,aes(x=lon,y=lat),color="red")
print(plt)
## Warning: Removed 40636 rows containing missing values (geom_point).

A challenge

Is there any relationship between the number of CCTV cameras and the number of arrests? Divide the city into a grid and plot the number of CCTV cameras vs. the number of arrests.

# step 1: divide city intro grid for arrest data
# step 1a: find the range of latitude and longitude
latRange=range(arrest_tab$lat,na.rm=TRUE)
lonRange=range(arrest_tab$lon,na.rm=TRUE)

# step 1b: discretize latitude into 50 bins
latGrid=seq(min(latRange),max(latRange),len=50)
latFactor=cut(arrest_tab$lat,breaks=latGrid)

# now longitude
lonGrid=seq(min(lonRange),max(lonRange),len=50)
lonFactor=cut(arrest_tab$lon,breaks=lonGrid)

# step 1c: make a factor indicating geographic grid location
gridFactor=factor(paste(lonFactor,latFactor,sep=":"))

# step 2: do the same for the cctv data
latFactor=cut(cctv_tab$lat,breaks=latGrid)
lonFactor=cut(cctv_tab$lon,breaks=lonGrid)
cctvGridFactor=factor(paste(lonFactor,latFactor,sep=":"))

arrestTab=table(gridFactor)
cctvTab=table(cctvGridFactor)
m=match(names(cctvTab),names(arrestTab))
plot(arrestTab[m]~factor(cctvTab),xlab="Number of CCTV cameras", ylab="Number of Arrests")

Extra analyses

As part of Project 1 you will add to this analysis. Please use the following template:

Mihai Sirbu

What question are you asking?:

I am trying to answer: at what time are most people arrested? For this prelimary analysis, I plan on making a plot where hour is the x-axis and the number of arrest is the y-axis. This will produced an “Arrest Timeseries.

What is the code you use to answer it?:

time <- strptime(arrest_tab$arrestTime, "%H:%M")
arrest_tab$hours <- as.numeric(format(time, "%H"))

hours_df <- as.data.frame(table(arrest_tab$hours))
names(hours_df) <- c("hour","count")

g <- ggplot(hours_df, aes(hour, count, group=1))+geom_line(color="blue")+geom_point(color="blue")
g <- g+labs(title = "Arrest Timeseries", x="Time of Day",y="Num of Arrests")
g <- g+scale_x_discrete(breaks=seq(0,23,2))
g <- g + theme(plot.title=element_text(size=16,face="bold"),axis.title.x=element_text(size=16,face="bold"),axis.title.y=element_text(size=16,face="bold"))
g

What did you observe?

I had originally thought that there would be very little arrests until 8 pm at which point there would be a giant spike from 8 pm to 5 am. But that was not the case. Instead, the two biggest hours of arrest were 6 pm followed by 10 am (!!). At this point, I’m not entirely sure why that might be. I would be surprised, however, if all offenses followed this exact same pattern.

Aaron Dugatkin

What question are you asking?: I am trying to find out how cameras affect the sorts of crimes in their area, both in reducing certain types of crime, or leading to finding more of other types of crime.

What is the code you use to answer it?:

# modified code from above, to create factors, but remove NA

# added by HCB to restore original arrest table
arrest_tab_original = arrest_tab
#

arrest_tab = arrest_tab[!is.na(arrest_tab$lat) & !is.na(arrest_tab$lon),]
latRange=range(arrest_tab$lat,na.rm=TRUE)
lonRange=range(arrest_tab$lon,na.rm=TRUE)
latGrid=seq(min(latRange),max(latRange),len=50)
latFactor=cut(arrest_tab$lat,breaks=latGrid)
lonGrid=seq(min(lonRange),max(lonRange),len=50)
lonFactor=cut(arrest_tab$lon,breaks=lonGrid)
gridFactor=factor(paste(lonFactor,latFactor,sep=":"))
latFactor=cut(cctv_tab$lat,breaks=latGrid)
lonFactor=cut(cctv_tab$lon,breaks=lonGrid)
cctvGridFactor=factor(paste(lonFactor,latFactor,sep=":"))
arrestTab=table(gridFactor)
cctvTab=table(cctvGridFactor)
#count crimes in areas with and without camera
arrestOnCamera = gridFactor %in% names(cctvTab)
count_crime_tab <- table(arrest_tab$incidentOffense, arrestOnCamera)
#merge the two tables, and calculate the difference in crime frequency in the two situations
crime_tab <- data.frame(count_crime_tab[,1], count_crime_tab[,2])
colnames(crime_tab)[1] <- "noCamCrimes"
colnames(crime_tab)[2] <- "camCrimes"
crime_tab$names <- rownames(crime_tab)
crime_tab$campct <- crime_tab$camCrimes/sum(crime_tab$camCrimes)*100
crime_tab$nocampct <- crime_tab$noCamCrimes/sum(crime_tab$noCamCrimes)*100
crime_tab$pctchange <- crime_tab$campct - crime_tab$nocampct
#display the change in crime frequency with crime name in descending order, with the most increased (caught) crimes first
crime_tab <- crime_tab[with(crime_tab, order(-pctchange)), ]
options(scipen=999)
subset(crime_tab, select=c("pctchange"))
##                                       pctchange
## 87-Narcotics                       8.7959299656
## 87O-Narcotics (Outside)            2.1085712279
## 55-Disorderly Person               0.4899152469
## 109-Loitering                      0.1601595042
## 79-Other                           0.1421819771
## 54-Armed Person                    0.0990559521
## 2A-Rape (Force)                    0.0638557474
## 6B-Larceny- Purse Snatch           0.0539250917
## 108-Liquor Law/Open Container      0.0514977916
## 117-Fto                            0.0498783769
## 1A-Murder                          0.0484986356
## 2B-Rape (Attempt)                  0.0480375073
## 3D-Robb Comm. (Ua)                 0.0476871064
## 61-Person Wanted On War            0.0430503173
## 7C-Stolen Veh./Other               0.0404124933
## 2F-Placing Hands                   0.0398369938
## 2D-Statutory Rape                  0.0392614943
## 3AO-Robb Hwy-Other Wpn             0.0381032079
## 88-Unfounded Call                  0.0316364803
## 78-Gambling                        0.0310500497
## 116-Public Urination / Defecation  0.0301387241
## 49-Family Disturbance              0.0279328789
## 5C-Burg. Res. (Noforce)            0.0253984949
## 58-Injured Person                  0.0226317250
## 3CK-Robb Comm-Knife                0.0165117545
## 60-Sick Person                     0.0163973833
## 8AO-Arson Sin Res Str-Occ          0.0148959835
## 80-Lost Property                   0.0130478265
## 20H-Traffic Control                0.0122399410
## 2H-Indecent Exp.                   0.0114284118
## 48-Involuntary Detention           0.0094658836
## 77-Dog Bite                        0.0087760130
## 73-False Pretense                  0.0083112410
## 106-Custody Dispute                0.0072746132
## 113-Littering                      0.0069278560
## 8J-Arson Other                     0.0061199705
## 76-Child Abuse-Sexual              0.0046185707
## 2C-Carnal Knowledge                0.0038106852
## 3M-Robb Bank (Ua)                  0.0038106852
## 64-Drug Free Zone                  0.0038106852
## 6K-Larceny- Park. Meter            0.0038106852
## 6A-Larceny- Pickpocket             0.0034639280
## 93-Abduction - Other               0.0034639280
## 3B-Robb Highway (Ua)               0.0024054381
## 20J-                               0.0011546427
## 84-Bomb Scare                      0.0011546427
## 8BO-Arson Oth Res Str-Occ          0.0011546427
## 104-Malicious Burning              0.0008078855
## 2G-Sodomy/Perverson               -0.0013870287
## 59-Intoxicated Person             -0.0015013998
## 71-Sex Offender Registry          -0.0015013998
## 3P-Robb Misc. (Ua)                -0.0017337858
## 3AJK-Robb Carjack-Knife           -0.0018481570
## 13-Assist Officer                 -0.0026560425
## 3F-Robb Gas Sta. (Ua)             -0.0026560425
## 6L-Larceny- From Locker           -0.0026560425
## 8I-Arson Other Mobile             -0.0026560425
## 6H-Larceny- From Machine          -0.0030027997
## 3GK-Robb Conv Store-Knife         -0.0041574423
## 8H-Arson Motor Veh                -0.0041574423
## 6F-Larceny- Bicycle               -0.0043898283
## 3GF-Robb Conv Store-Firearm       -0.0045041995
## 112-Traffic Related Incident      -0.0048546003
## 85-Mental Case                    -0.0051977138
## 102-Questional Death              -0.0053120850
## 110-Summons Served                -0.0053120850
## 28-Suicide - Attempt              -0.0053120850
## 3NF-Robb Misc-Firearm             -0.0063523565
## 3NO-Robb Misc-Other Wpn           -0.0068134848
## 3JF-Robb Residence-Firearm        -0.0070458708
## 56-Missing Person                 -0.0073926280
## 3GO-Robb Conv Store-Other Wpn     -0.0079681275
## 3LF-Robb Bank-Firearm             -0.0079681275
## 3LO-Robb Bank-Other Wpn           -0.0079681275
## 70A-Ill. Dumping                  -0.0083148847
## 81-Recovered Property             -0.0097055570
## 67-Child Abuse-Physical           -0.0098162845
## 3H-Robb Conv. Stor.(Ua)           -0.0121255698
## 52A-Animal Cruelty                -0.0121255698
## 103-Dead On Arrival               -0.0124723270
## 2J-Other Sex Offn.                -0.0132802125
## 3CO-Robb Comm-Other Wpn           -0.0136269697
## 3JO-Robb Residence-Other Wpn      -0.0136269697
## 114-Hindering                     -0.0152463843
## 3BJ-Robb Carjack(Ua)              -0.0158218838
## 39-Fire                           -0.0174376548
## 83-Discharging Firearm            -0.0174376548
## 3CF-Robb Comm-Firearm             -0.0191714407
## 3NK-Robb Misc-Knife               -0.0215950971
## 96-Stop & Frisk                   -0.0215950971
## 3AJF-Robb Carjack-Firearm         -0.0236756401
## 29-Driving While Intox.           -0.0245978968
## 5E-Burg. Oth. (Att.)              -0.0257525395
## 3AK-Robb Hwy-Knife                -0.0314150253
## 6E-Larceny- Auto Acc              -0.0348789533
## 98-Child Neglect                  -0.0400730235
## 5F-Burg. Oth. (Noforce)           -0.0405341518
## 3JK-Robb Residence-Knife          -0.0454994796
## 3AF-Robb Hwy-Firearm              -0.0496678530
## 23-Unauthorized Use               -0.0497749368
## 95-Exparte                        -0.0556625213
## 4F-Assault By Threat              -0.0558985510
## 26-Recovered Vehicle              -0.0613322945
## 6D-Larceny- From Auto             -0.0768110648
## 5B-Burg. Res. (Att.)              -0.0876530461
## 20A-Followup                      -0.0903127323
## 111-Protective Order              -0.0968938311
## 6J-Larceny- Other                 -0.1005974324
## 3K-Robb Res. (Ua)                 -0.1019735301
## 5D-Burg. Oth. (Force)             -0.1114467011
## 6G-Larceny- From Bldg.            -0.1242657852
## 7A-Stolen Auto                    -0.1317800717
## 115-Trespassing                   -0.1403564919
## 4A-Agg. Asslt.- Gun               -0.2195365579
## 75-Destruct. Of Property          -0.2844580826
## 4D-Agg. Asslt.- Hand              -0.3185182194
## 6C-Larceny- Shoplifting           -0.4077418953
## 4B-Agg. Asslt.- Cut               -0.4938461736
## 5A-Burg. Res. (Force)             -0.5879404416
## 4C-Agg. Asslt.- Oth.              -0.6379805575
## 24-Towed Vehicle                  -1.0684576178
## 97-Search & Seizure               -1.1822536848
## 55A-Prostitution                  -1.1994275774
## Unknown Offense                   -1.5811234143
## 4E-Common Assault                 -2.7616680093
# added by HCB to restore original arrest table
arrest_tab = arrest_tab_original

What did you observe? The results were interesting. We see a large increase in charges of narcotics, which may be due to camera surveillance. We also see a decrease in assault, which may be due to the perpetrators of such crimes realizing the dangers of committing such crimes in front of a camera. However, the vast majority of crimes do not even see a 1% change between the two situations, so it would appear as though, overall, cameras do not have a major affect on criminal activity.

Anna Petrone

What question are you asking?: Which neighborhoods in Baltimore have the higest number of arrests?

What is the code you use to answer it?:

Load libraries

library(rgdal) # needed for reading shape files
## Loading required package: sp
## rgdal: version: 0.9-1, (SVN revision 518)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.10.1, released 2013/08/26
## Path to GDAL shared files: /usr/share/gdal/1.10
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
## Path to PROJ.4 shared files: (autodetected)
library(plyr) # needed for rename function
library(sp) # needed for point.in.polygon function 
library(ggmap) # could use for geocoding addresses
library(ggplot2) # needed for plotting

Find number of arrests for which the geo coordinates weren’t given

no.geo.idx = nchar(arrest_tab$Location.1) == 0
n.geo.missing = sum( no.geo.idx )
narrests = dim(arrest_tab)[1]
n.geo.missing/narrests*100 # 39%
## [1] 38.87571

Find the number of incidents who dont have geo code info, but the incidentLocation is provided

has.location = nchar(arrest_tab$incidentLocation) > 0 
sum(no.geo.idx & has.location)
## [1] 7650
#tmp = paste(arrest_tab$incidentLocation[no.geo.idx & has.location], "Baltimore, MD")
#gc = geocode(tmp[1:2490]) # restricted to 2500 api requests per day

Get the 2010 statistical community areas here Download the shape files and extract from the .zip file

setwd("csa_2010_boundaries/")
csa = readOGR(dsn=".",layer="CSA_NSA_Tracts")
## OGR data source with driver: ESRI Shapefile 
## Source: ".", layer: "CSA_NSA_Tracts"
## with 56 features and 3 fields
## Feature type: wkbPolygon with 2 dimensions
csa.df = fortify(csa) # fortify turns the shape data into a data.frame
## Regions defined for each Polygons
csa.df = rename(csa.df, c("long"="X.feet","lat"="Y.feet")) # MD uses State Plane coords instead of lat/lon (see comments section)

convert = FALSE
if (convert){ # write a file to send to matlab code (described in comments section)
  write.csv(csa.df[,c("lat","lon")], "csa-df.txt",  quote=FALSE, na="",row.names=FALSE)
}
csa.converted.df = read.csv("csa-df_converted.txt",header=FALSE) # output of the matlab code, converted to lat/lon
setwd("..")

csa.converted.df = rename(csa.converted.df, c("V1"="lat","V2"="lon"))
csa.df = cbind(csa.df, csa.converted.df)

Now assign each of the arrest records to a neighborhood but this is only possible for the records that have geo info. This step takes about 15-20 seconds

ncsa = dim(csa)[1]
arrest_tab_geo = arrest_tab[!no.geo.idx,]
narrests.geo = dim(arrest_tab_geo)[1]

arrest_nbhd_id = vector(length = narrests.geo)

for (j in 1:ncsa) { # takes about 30 sec
  idx = csa.df$id == j-1
  polyx = csa.df$lon[idx]
  polyy = csa.df$lat[idx]
  
  in.poly = point.in.polygon(arrest_tab_geo$lon, arrest_tab_geo$lat, polyx,polyy)
  in.poly= as.logical(in.poly)
  arrest_nbhd_id[in.poly] = j - 1

}

arrest_tab_geo = cbind(arrest_tab_geo, arrest_nbhd_id)

For each neighborhood, count the number of arrests, using the table function

nbhd.narrests = as.data.frame(table(arrest_nbhd_id))
nbhd.narrests = rename(nbhd.narrests, c("arrest_nbhd_id"="id", "Freq"="narrests"))
nbhd.names= as.vector(csa$Neigh)
nbhd.narrests = cbind(nbhd.names, nbhd.narrests)
head(nbhd.narrests)
##                                                                                                                nbhd.names
## 1 Allendale, Carroll-South Hilton, Gwynns Falls, Irvington, Saint Josephs, Uplands, Yale Heights, Lower Edmondson Village
## 2                                                     Beechfield, Hunting Ridge, Ten Hills, Tremont, West Hills, Westgate
## 3                                                   Belair-Edison, Clifton Park, Four By Four, Mayfield, Herring Run Park
## 4                                         Brooklyn, Curtis Bay, Fairfield Area, Hawkins Point, Curtis Bay Industrial Area
## 5                                                                                                  Canton, Patterson Park
## 6                                                                                   Cedonia, Frankford, Cedmont, Parkside
##   id narrests
## 1  0     1358
## 2  1      346
## 3  2     1752
## 4  3     2155
## 5  4      165
## 6  5     1330

Merge the arrest counts with the geometry data

csa.df = merge(csa.df, nbhd.narrests, by="id",all.x=TRUE)

Make a plot colored by number of arrests

g = ggplot(csa.df,aes(x=lon,y=lat,group=group))   
g = g + geom_polygon(aes(fill=narrests)) + scale_fill_gradient(low="slategray1",high="slateblue4") # color the nbhds by narrests
g = g + geom_path(colour="gray75",size=.1) # draw lines separating the neighborhoods
g = g + ggtitle("Baltimore City Arrests 2011 - 2012") # add a title
g = g + theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), 
              axis.title.x = element_blank(), axis.title.y = element_blank()) # remove axis labels
print(g)

What did you observe? First, it should be noted that out of the 104,528 arrest records, 40,636 of them (about 39%) did not have geocoded locations (latitude and longitude). Some of them (7,650) did have and adress in the incidentLocation field, so it would be possible to geocode these, though it would not contribute enormously. (I did however take a look at the ggmap library which can convert an address string into lat and lon by calling the google maps geocoding API, however it only allows 2,500 requests per day.) Therefore my analysis only includes the 61% of records which provided geocoded information.

Second, I need to note that it was a pain converting from the MD State Plane coordinates into longitude and latitude. I ended up using an external matlab function to do the conversion, since it seemed really confusing to do in R.

The results: As could probably be expected, the highest number of arrests occured in the downtown area, with the northwest area being notably high as well. The inner harbor neighborhood is among the lowest which makes sense as this area is more touristy. The neighborhoods in the central northern neighborhoods are also on the low end (I don’t know Baltiore but I’m guessing these are higher income neighborhoods).

For future analysis, it would be good to create a similar plot where the colors represent neighborhood income level. I would also like to add a layer showing the locations of transit stations, since these are commonly believed to attract crime.

Raul Alfaro

What question are you asking?: Which is the most common crime per race?

What is the code you use to answer it?:

head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="A"]),decreasing=TRUE))
## 
##         Unknown Offense            87-Narcotics       4E-Common Assault 
##                      98                      42                      22 
## 87O-Narcotics (Outside)        24-Towed Vehicle 61-Person Wanted On War 
##                      13                      10                       6
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="B"]),decreasing=TRUE))
## 
##         Unknown Offense            87-Narcotics 87O-Narcotics (Outside) 
##                   31500                   21783                    5706 
##       4E-Common Assault     97-Search & Seizure                79-Other 
##                    5500                    3255                    2879
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="H"]),decreasing=TRUE))
## 87-Narcotics 
##            1
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="I"]),decreasing=TRUE))
## 
##      Unknown Offense         87-Narcotics    4E-Common Assault 
##                   92                   44                   14 
##  97-Search & Seizure     24-Towed Vehicle 4C-Agg. Asslt.- Oth. 
##                    8                    6                    5
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="U"]),decreasing=TRUE))
## 
##         Unknown Offense       4E-Common Assault            87-Narcotics 
##                     893                     179                     171 
## 87O-Narcotics (Outside)     4B-Agg. Asslt.- Cut                79-Other 
##                      51                      48                      46
head(sort(table(arrest_tab$incidentOffense[arrest_tab$race=="W"]),decreasing=TRUE))
## 
##         Unknown Offense            87-Narcotics       4E-Common Assault 
##                    6065                    2702                    1024 
## 87O-Narcotics (Outside)        55A-Prostitution                79-Other 
##                     740                     671                     528

What did you observe? I observed asside from the Unknown Offenses the most common crime for all races but 1 was Narcotics, the “race” had common Assault as their most common crime.

Kim St. Andrie, Rain Surasorn

What question are you asking?: Which year had the largest number of arrests?

What is the code you use to answer it?:

a <- data.frame(id = arrest_tab$arrest, year = substr(arrest_tab$arrestDate,7,11))
head(sort(table(a$year), decreasing=TRUE),10)
## 
##  2011  2012 
## 52868 51660

What did you observe? 2011 had the largest number of arrests but there was no dramatic difference between the number of arrests for each year. There were 52,868 arrests in 2011 which was 1208 more than the number of arrests for 2012.

Rentao Wu

What are you asking:

I wanted to know if the ratio of female to male crime rates are similar accross the difference races.

What is the code you used to answer it?

mytab = table(race=arrest_tab$race, sex=arrest_tab$sex)
#mydf$V1 = NULL
mydf = as.data.frame.matrix(mytab) 
mydf$ratio = mydf$F/mydf$M
mydf$ratio <- round(mydf$ratio, 3)
mydf = mydf[-1,]
mydf
##   V1     F     M ratio
## A  0    37   205 0.180
## B  0 14663 72605 0.202
## H  0     0     1 0.000
## I  0    34   184 0.185
## U  0   183  1566 0.117
## W  0  4514 10534 0.429
mydf$race <- c("A","B","H","I","U","W")

ggplot(data=mydf, aes(x=race, y=ratio, fill=race)) + geom_bar(stat="identity")

What are your observations? I found out that for most race, the ratio of female to male crime rates is about 0.2. This tells us that there are about 1 female for ever 5 male criminal offenses. I also saw that the female to male crime ratio for the white population is about 0.43 which is much higher than the others.

Krishna Pai

What question are you asking?: Do police officers go out of their way to arrest more black people than white people?

What is the code you use to answer it?:

library(ggplot2)
# added by HCB to not dirty global environment
kpai=function()
  {
    arrest_tab=read.csv("BPD_Arrests.csv", stringsAsFactors=FALSE)
    tmp=arrest_tab$sex
    arrest_tab$sex=arrest_tab$race
    arrest_tab$race=tmp
    police_tab=read.csv("Police_Stations.csv", stringsAsFactors=FALSE)
    tmp=gsub("\\)","",gsub("\\(","",arrest_tab$Location))
    tmp=strsplit(tmp,split=",")
    arrest_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
    arrest_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))
    tmp=gsub("\\)","",gsub("\\(","",police_tab$Location))
    tmp=strsplit(tmp,split=",")
    police_tab$lon=as.numeric(sapply(tmp,function(x) x[2])) 
    police_tab$lat=as.numeric(sapply(tmp,function(x) x[1])) 
    plt=ggplot()
    plt=plt+geom_point(data=arrest_tab[arrest_tab$race=='B',],aes(x=lon,y=lat),color="black",alpha=.1)
    plt=plt+geom_point(data=arrest_tab[arrest_tab$race=='W',],aes(x=lon,y=lat),color="white",alpha=.1)
    plt=plt+geom_point(data=police_tab,aes(x=lon,y=lat),color="red")
    print(plt)
  }
kpai()
## Warning: Removed 33454 rows containing missing values (geom_point).
## Warning: Removed 6251 rows containing missing values (geom_point).

What did you observe? I was surprised to find that most of the arrests made some distance away from the cluster of police stations were for white people. It would be interesting to investigate what kinds of crimes might have been committed so far away from the stations, and why white people stand out as being arrested at that distance, especially in the east.

2208Jay

Group Members
  • Praneet Puppala
  • Jason Rubin
  • Michael Younkin
  • Haoyue (Jay) Zhang
What question are you asking?:

Is there a correlation between arrest location and the locations of vacant buildings?

What is the code you use to answer it?:
# Read in the larger arrests dataset
recentArrests <- read.csv("full_BPD_Arrests.csv", header=TRUE)

# Convert the ArrestDate column
recentArrests$ArrestDate <- as.Date(recentArrests$ArrestDate, "%m/%d/%Y")

# Remove old arrests and arrests without reported locations
recentArrests <- subset(recentArrests, recentArrests$ArrestDate>="2015-01-01")
recentArrests <- subset(recentArrests, recentArrests$Location.1 != "")

# Read in the vacant buildings dataset
vacantBldgs = read.csv("Vacant_Buildings.csv", stringsAsFactors=FALSE,
header=TRUE)

# Remove rows without a location
vacantBldgs <- subset(vacantBldgs, vacantBldgs$Location.1 != "")

# Extract lat and lon into numeric columns
tmp = vacantBldgs$Location.1
tmp=gsub("\\)","",gsub("\\(","",vacantBldgs$Location.1))
tmp=strsplit(tmp, split=", ")
vacantBldgs$lon=as.numeric(sapply(tmp,function(x) x[2]))
vacantBldgs$lat=as.numeric(sapply(tmp,function(x) x[1]))

tmp = recentArrests$Location.1
tmp=gsub("\\)","",gsub("\\(","",recentArrests$Location.1))
tmp=strsplit(tmp, split=", ")
recentArrests$lon=as.numeric(sapply(tmp,function(x) x[2]))
recentArrests$lat=as.numeric(sapply(tmp,function(x) x[1]))

balto_map = subset(map_data("county", region="maryland"),subregion=="baltimore city")
plt=ggplot()
plt=plt+geom_polygon(data=balto_map,aes(x=long,y=lat),color="white",fill="gray40")
plt=plt+geom_point(data=vacantBldgs,aes(x=lon,y=lat),color="blue",alpha=.1)
plt=plt+geom_point(data=recentArrests,aes(x=lon,y=lat),color="red",alpha=.1)
print(plt)

What did you observe?

Based on the plot we produced, we observe that arrests are more concentrated around areas with vacant buildings.

Note: because there were no dates associated with when buildings became vacant (the data is updated twice a month), we used arrest records from the month of January 2015 to minimize the number of “extra” vacant buildings. We assume that there was not a significant number of new vacant buildings added since the month of January.

Zach Jiroun, Jose Zamora, Des Chandhok

What question are you asking?:

What is the female to male ratio of different types of crimes on Valentine’s Day, 2011?

library(plyr)
vdaytab <- table(incident=arrest_tab$incidentOffense,sex=arrest_tab$sex,date=arrest_tab$arrestDate=="02/14/2011")
vdaytab <- vdaytab[,,-1]
vdaydf = as.data.frame.matrix(vdaytab)
vdaydf$ratio = vdaydf$F/vdaydf$M
vdaydf$ratio <- round(vdaydf$ratio, 3)
vdaydf <- subset(vdaydf, M > 0 | F > 0)
vdaydf[-1]
##                               F  M ratio
## 111-Protective Order          0  1 0.000
## 112-Traffic Related Incident  1  0   Inf
## 115-Trespassing               0  1 0.000
## 23-Unauthorized Use           2  0   Inf
## 24-Towed Vehicle              0  2 0.000
## 26-Recovered Vehicle          1  0   Inf
## 3K-Robb Res. (Ua)             0  2 0.000
## 3P-Robb Misc. (Ua)            0  1 0.000
## 4B-Agg. Asslt.- Cut           1  1 1.000
## 4C-Agg. Asslt.- Oth.          0  1 0.000
## 4E-Common Assault             1  4 0.250
## 54-Armed Person               0  1 0.000
## 5D-Burg. Oth. (Force)         0  1 0.000
## 6C-Larceny- Shoplifting       0  2 0.000
## 79-Other                      1  8 0.125
## 7A-Stolen Auto                1  0   Inf
## 87-Narcotics                  0 13 0.000
## 87O-Narcotics (Outside)       1  4 0.250
## 97-Search & Seizure           1  1 1.000
## Unknown Offense              11 52 0.212
vdaydf$offense <- c("111", "112", "115", "23", "24", "26", "3K", "3P", "4B", "4C", "4E", "54", "5D", "6C", "79", "7A", "87", "87O", "97", "Unknown")
vdaydf$offense_description <- c("(111) Protective Order", "(112) Traffic Related Incident", "(115) Trespassing", "(23) Unauthorized Use", "(24) Towed Vehicle", "(26) Recovered Vehicle", "(3K) Robb Res. (Ua)", "(3P) Robb Misc. (Ua)", "(4B) Agg. Asslt.- Cut", "(4C) Agg. Asslt.- Oth.", "(4E) Common Assault", "(54) Armed Person", "(5D) Burg. Oth. (Force)", "(6C) Larceny- Shoplifting", "(79) Other", "(7A) Stolen Auto", "(87) Narcotics", "(87O)Narcotics (Outside)", "(97) Search & Seizure", "Unknown Offense")
ggplot(data=vdaydf, aes(x=offense, y=ratio, fill=offense_description)) + geom_bar(stat="identity") + ggtitle("Valentine's Day 2011 F/M Crime Ratio") + theme(axis.text.x = element_text(angle=90))

# Removing 0 and Inf ratios
v <- subset(vdaydf, ratio > 0 & ratio < Inf)
v[-1]
##                          F  M ratio offense      offense_description
## 4B-Agg. Asslt.- Cut      1  1 1.000      4B    (4B) Agg. Asslt.- Cut
## 4E-Common Assault        1  4 0.250      4E      (4E) Common Assault
## 79-Other                 1  8 0.125      79               (79) Other
## 87O-Narcotics (Outside)  1  4 0.250     87O (87O)Narcotics (Outside)
## 97-Search & Seizure      1  1 1.000      97    (97) Search & Seizure
## Unknown Offense         11 52 0.212 Unknown          Unknown Offense
v$offense <- c("4B", "4E", "79", "87O", "97", "Unknown")
v$offense_description <- c("Agg. Asslt.- Cut", "Common Assault", "Other", "Narcotics (Outside)", "Search & Seizure", "Unknown Offense")
ggplot(data=v, aes(x=offense, y=ratio, fill=offense_description)) + geom_bar(stat="identity") + ggtitle("Valentine's Day 2011 F/M Crime Ratio")

What did you observe? We were surprised to find that there were significantly more Male arrests than Female arrests. In the categories where Males and Females were arrested for the same crime, only about 1/4 of them were Female. We also thought it was interesting that there was only 1 Female Narcotics related arrest. Whether or not 17 Narcotics related Male arrests were related to Valentine’s Day remains to be seen.

Klar Kuo

What are you asking?

What types of crimes have increased and decreased in frequency?

What is the code you use to answer it?

library(ggplot2)
library(stringr)
require(data.table)
## Loading required package: data.table
# Filter the list so that it only includes types of offenses that have happened over 500 times
# (So that it has enough data to be meaningful)
arrest_type <- table(arrest_tab$incidentOffense)
arrest_type[arrest_type >= 500]
## 
##          115-Trespassing         24-Towed Vehicle      4B-Agg. Asslt.- Cut 
##                      871                     2994                     1195 
##     4C-Agg. Asslt.- Oth.     4D-Agg. Asslt.- Hand        4E-Common Assault 
##                     1556                      618                     6739 
##     55-Disorderly Person         55A-Prostitution    5A-Burg. Res. (Force) 
##                      923                     1398                      847 
##  6C-Larceny- Shoplifting 75-Destruct. Of Property                 79-Other 
##                     1849                      686                     3461 
##             87-Narcotics  87O-Narcotics (Outside)      97-Search & Seizure 
##                    24744                     6515                     3670 
##          Unknown Offense 
##                    38649
arrest_type_list <- names(arrest_type[arrest_type >= 500])
arrest_type_list
##  [1] "115-Trespassing"          "24-Towed Vehicle"        
##  [3] "4B-Agg. Asslt.- Cut"      "4C-Agg. Asslt.- Oth."    
##  [5] "4D-Agg. Asslt.- Hand"     "4E-Common Assault"       
##  [7] "55-Disorderly Person"     "55A-Prostitution"        
##  [9] "5A-Burg. Res. (Force)"    "6C-Larceny- Shoplifting" 
## [11] "75-Destruct. Of Property" "79-Other"                
## [13] "87-Narcotics"             "87O-Narcotics (Outside)" 
## [15] "97-Search & Seizure"      "Unknown Offense"
# All the arrests for those types that we singled out
arrest_dates <- arrest_tab[arrest_tab$incidentOffense %in% arrest_type_list ,]
# Set all the counts to one so we can sum them when we aggregate
arrest_table <- data.table(arrest_dates)
arrest_table <- arrest_table[, count := 1, by=arrestDate]

# Aggregate all the data together by how many times a type of incident occured on each day
arrest_table <- setNames(aggregate(arrest_table$count, list(arrest_table$incidentOffense, arrest_table$arrestDate), sum), c("incidentOffense","arrestDate","count"))

# Graph it
ggplot(data=arrest_table, aes(x=arrestDate, y=count, group=incidentOffense, colour=incidentOffense)) +
       geom_line() +
       geom_point()

# The last graph has too many data points crammed in, so we aggregate the data by month and graph it again
# Set a shortdate column with a simple month and year so the system can aggregate them on it
arrest_table$shortdate <- strftime(as.Date(arrest_table$arrestDate, "%m/%d/%Y"), format="%y/%m")
arrest_table <- setNames(aggregate(arrest_table$count, list(arrest_table$incidentOffense, arrest_table$shortdate), sum), c("incidentOffense","arrestMonth","count"))

ggplot(data=arrest_table, aes(x=arrestMonth, y=count, group=incidentOffense, colour=incidentOffense)) +
       geom_line() +
       geom_point()

# Lets take a look at the graph without the highest two types of occurences, because those two are so much higher
low_arrests <- arrest_table[arrest_table$incidentOffense!="87-Narcotics",]
low_arrests <- low_arrests[low_arrests$incidentOffense!="Unknown Offense",]

ggplot(data=low_arrests, aes(x=arrestMonth, y=count, group=incidentOffense, colour=incidentOffense)) +
       geom_line() +
       geom_point()

# Take a closer look at the graph from the most common type of occurence that's now "Unknown"
# (87-Narcotics)
high_arrests <- arrest_table[arrest_table$incidentOffense=="87-Narcotics",]
high_arrests
##     incidentOffense arrestMonth count
## 13     87-Narcotics       11/01   740
## 29     87-Narcotics       11/02   830
## 45     87-Narcotics       11/03   824
## 61     87-Narcotics       11/04   887
## 77     87-Narcotics       11/05  1140
## 93     87-Narcotics       11/06  1080
## 109    87-Narcotics       11/07  1034
## 125    87-Narcotics       11/08   991
## 141    87-Narcotics       11/09  1044
## 157    87-Narcotics       11/10  1052
## 173    87-Narcotics       11/11   822
## 189    87-Narcotics       11/12   983
## 205    87-Narcotics       12/01  1030
## 221    87-Narcotics       12/02   982
## 237    87-Narcotics       12/03  1245
## 253    87-Narcotics       12/04  1042
## 269    87-Narcotics       12/05  1165
## 285    87-Narcotics       12/06  1088
## 301    87-Narcotics       12/07  1117
## 317    87-Narcotics       12/08  1444
## 333    87-Narcotics       12/09  1245
## 349    87-Narcotics       12/10  1127
## 365    87-Narcotics       12/11   940
## 381    87-Narcotics       12/12   892
ggplot(data=high_arrests, aes(x=arrestMonth, y=count, group=incidentOffense, colour=incidentOffense)) +
       geom_line() +
       geom_point()

# Now lets split this into two lines, one for each year
high_arrests$year <- str_split_fixed(high_arrests$arrestMonth, "/", 2)[,1]
high_arrests$month <- str_split_fixed(high_arrests$arrestMonth, "/", 2)[,2]

ggplot(data=high_arrests, aes(x=month, y=count, group=year, colour=year)) +
       geom_line() +
       geom_point()

What did you observe?

Between the beginning of 2011 and the end of 2012, crime rates fluctuated quite a bit, but seemed to end up at around the same level. To be more specific, the crime rates for the most popular types of crimes all seemed to increase in frequency between the two dates, but then drop back to around what it started at.

After looking at graphs for each year specifically, it looks like crime rises towards the middle of the year and then dies down towards winter. This could be due to the weather or perhaps the holiday scheduling of the police (or of criminal activity).

However, it does look like the rate of crime for this time period increased right in the middle. In the year-separated-graph for Narcotics, the end of 2011 had clearly higher levels than the start of 2011, and the start of 2012 was higher than the end of 2012. This could be attributed to lower levels of employment in that time

Albert Koy

What question are you asking?

Do people commit fewer crimes during the holidays?

What is the code you use to answer it?
(function() {
  source("calendarHeat_modified.R")

  arrest_tab=read.csv("BPD_Arrests.csv", stringsAsFactors=FALSE)
  arrest_counts_tab <- table(arrest_tab$arrestDate)
  arrest_counts_df <- data.frame(date=as.Date(names(arrest_counts_tab), "%m/%d/%Y"), num_arrests=as.vector(arrest_counts_tab))

  calendarHeat(arrest_counts_df$date, arrest_counts_df$num_arrests, color="g2r", varname="Number of Arrests Per Day")
})()
## Loading required package: lattice
## Loading required package: grid
## Loading required package: chron

What did you observe?

Looking at the heat map for 2011 and 2012, there is a very distinct reduction in number of arrests on Thanksgiving (the fourth Thursday of November). Moreover, on Christmas (the last Sunday of December in 2011 and the last Tuesday of December in 2012) and the days leading up to Christmas, we observe green squares as well. It’s hard to tell whether this reduction in arrests is due to fewer officers working on the holidays or due to fewer crimes being committed. In making this judgement, it would be useful to have data on the number of officers working each day.

Nicholas Mosquera

What question are you asking?: How do the number of arrests change from 4th of July to Christmas for 2011 and 2012?

What is the code you use to answer it?:

a <- table(arrest_tab$arrest[arrest_tab$arrestDate=="07/04/2011"])
b <- table(arrest_tab$arrest[arrest_tab$arrestDate=="12/25/2011"])
c <- table(arrest_tab$arrest[arrest_tab$arrestDate=="07/04/2012"])
d <- table(arrest_tab$arrest[arrest_tab$arrestDate=="12/25/2012"])
num_of_arrests <- c(length(a),length(b),length(c),length(d))
barplot(num_of_arrests, main="Number of Arrests for 4th of July and Christmas", names.arg=c("07/04/2011","12/25/2011","07/04/2012","12/25/2012"))

What did you observe?: There are actually mixed results. For the 4th of July, there were more arrests in 2012(132) than in 2011(113). For Christmas, there were more arrests in 2011(53) than in 2012(43). There are many more arrests on the 4th of July for both years.

James Collins, Beshad Talayminaei

What are you asking:

We wanted to answer this question: What percentage of crime per sex is theft related? In other words; if you are arrested as a male what are the odds that your offense was theft related? Does this proportion change if you are female? To answer this we divide the total number of theft related crimes per sex by the total number of crimes per sex. For the purposes of this question we filter out unknown offenses and unknown sexes.

What is the code you used to answer it?

theft_regex <- regexpr("burg|stolen|robb|larceny", arrest_tab$incidentOffense, ignore.case=T, perl=T)
totalTab <- table(arrest_tab$incidentOffense[arrest_tab$sex != "" & arrest_tab$incidentOffense != "Unknown Offense"], arrest_tab$sex[arrest_tab$sex != "" & arrest_tab$incidentOffense != "Unknown Offense"])
robbTab <- table(arrest_tab$incidentOffense[theft_regex > -1], arrest_tab$sex[theft_regex > -1])
robbSum <- colSums(robbTab)
totalSum <- colSums(totalTab)
robbData <- data.frame(Female = robbSum[1], Male = robbSum[2])
totalData <- data.frame(Female = totalSum[1], Male = totalSum[2])
theftFemaleProp <- (robbData$Female / totalData$Female) * 100
theftMaleProp <- (robbData$Male / totalData$Male) * 100
theftFemaleProp
## [1] 9.831437
theftMaleProp
## [1] 9.40193

What did you observe? We were suprised to see that females are more likely to be arrested for a theft related crime. This is mainly due to the larceny arrests.

larceny_regex <- regexpr("larceny", arrest_tab$incidentOffense, ignore.case=T, perl=T)
larcenyTab <- table(arrest_tab$incidentOffense[larceny_regex > -1], arrest_tab$sex[larceny_regex > -1])
larcenySum <- colSums(larcenyTab)
totalSum <- colSums(totalTab)
larcenyData <- data.frame(Female = larcenySum[1], Male = larcenySum[2])
totalData <- data.frame(Female = totalSum[1], Male = totalSum[2])
larcFemaleProp <- (larcenyData$Female / totalData$Female) * 100
larcMaleProp <- (larcenyData$Male / totalData$Male) * 100
larcFemaleProp
## [1] 6.639856
larcMaleProp
## [1] 3.971139
slices <- c(theftFemaleProp, theftMaleProp)
lbls <- c("Female", "Male")
pie(slices, labels = lbls, main="Theft")

slices <- c(larcFemaleProp, larcMaleProp)
lbls <- c("Female", "Male")
pie(slices, labels = lbls, main="Larceny")

Cynthia Gan

What question are you asking?:

I am asking if there are differences in the distribution of arrests throughout the day depending on the age of the arrestee.

What is the code you use to answer it?:

#Bin the arrest data into age and time period chunks
ageGrid = seq(0,90,10)
ageFactor=cut(arrest_tab$age, breaks=ageGrid)
hourGrid = seq(0, 24, 3)
hourFactor=cut(arrest_tab$hours, breaks=hourGrid)
#Create a mosaic plot
mosaicplot(table(ageFactor, hourFactor), color=colorRampPalette(c('dark cyan','white','dark goldenrod'))(8), main="Number of Arrests by Age and Time of Day", xlab ="Age", ylab = "Time")

What did you observe?

The distribution of arrest times across age groups is actually quite similar. However, there is a trend where the young and the elderly are less likely to be arrested during beginnnings and ends of the day. Also, people 20 years old and younger are active in the afternoon and evening rather than the morning. The drop in activity in the hours of 3am-6am observed by Mihai is shared by all age groups.

Joe Downs, Michael Kleyman, Anumeet Nepaul

What question are you asking?:

We were curious about how temperature relates to crime. Does crime count correlate with temperature? We used average temperature values for Baltimore from wunderground and normalized our counts for each temperature by the number of days that experienced that temperature. It seems that there should be more crimes in warmer weather because both criminals and victims should be more likely to spend time outside when it is warmer.

What is the code you use to answer it?:

    # load weather data from csv file
    baltimore_weather <- read.csv("baltimore_weather.csv", stringsAsFactors=FALSE)

    # store abbrevation of weather data and add formatting for date 
    tempAbr <- baltimore_weather[,c("EST","Mean.TemperatureF")]
    tempAbr$EST <- as.Date(baltimore_weather$EST,"%Y-%m-%d")

    # copy arrest data and add formatting for date
    Date_Arrests <- arrest_tab
    Date_Arrests$arrestDate <- as.Date(Date_Arrests$arrestDate,"%m/%d/%Y")

    # join weather and crime tables to match days together
    tempMerge<-merge(x=Date_Arrests,y=tempAbr, by.x ="arrestDate", by.y="EST" )
    tableCrime <- table(tempMerge$Mean.TemperatureF)
    tableTemp <- table(baltimore_weather$Mean.TemperatureF)

    # plot crime count versus temperature, normalized by day count for each temperature
    plot(tableCrime/tableTemp, xlab="Temperature (F)", ylab="Normalized Crime Count")

    divTable = tableCrime/tableTemp
    divtmatrix = as.data.frame(divTable)
    # cast factors as numeric values for correlation
    divtmatrix[,1] <- as.numeric(as.character(divtmatrix[,1]))
    divtmatrix[,2] <- as.numeric(divtmatrix[,2])
    cor(divtmatrix[,1],divtmatrix[,2])
## [1] 0.3673159

What did you observe?:

While we suspected that crime count would correlate with temperature, we found that the correlation coefficient was only 0.3673, suggesting that crime is not influenced linearly by the temperature. There also seems to be no clear non-linear dependence of crime on temperature for our data. Compared to this (http://crime.static-eric.com/) analysis of Chicago, temperature appears to relate to crime in Baltimore to a much smaller degree.

Alyster Alcudia

What question are you asking?:

According to a 2009 study by the Department of Justice, children exposed to violence are more likely to abuse drugs and alcohol; suffer from depression, anxiety, and post-traumatic disorders; fail or have difficulty in school; and become delinquent and engage in criminal behavior.*

Considering how many youths are admitted into the Baltimore Juvenile Justice Center per year, I will be asking the following:

How close to public elementary schools do arrests for violent crimes occur?

*Finkelhor, D., Turner, H., Ormrod, R., Hamby, S., and Kracke, K. 2009. Children’s Exposure to Violence: A Comprehensive National Survey. Bulletin. Washington, DC: U.S. Department of Justice, Office of Justice Programs, Office of Juvenile Justice and Delinquency Prevention.

What is the code you use to answer it?:

Let’s load the school location data from https://data.baltimorecity.gov/Geographic/Baltimore-City-Public-School-System-Elementary-Sch/6jcd-xgn7.

school_tab=read.csv("bcpss_es_boundary.csv", stringsAsFactors=FALSE)

Now let’s filter the arrest data to keep only violent crimes, based on the filter from https://data.baltimorecity.gov/Public-Safety/Violent-Crime/s97b-tych.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:data.table':
## 
##     between, last
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
violent_str<-"Murder|Rape|Assault|Robbery"

# look through incidentOffense and chargeDescription
violent_tab<-
  filter(arrest_tab, 
  (grepl(violent_str, incidentOffense) | 
  grepl(violent_str, chargeDescription)))

We still need to extract longitude and latitude from violent_tab’s Location.

tmp=gsub("\\)","",gsub("\\(","",violent_tab$Location))
tmp=strsplit(tmp,split=",")
violent_tab$lon=as.numeric(sapply(tmp,function(x) x[2]))
violent_tab$lat=as.numeric(sapply(tmp,function(x) x[1]))

We can now plot locations of violent crimes arrests in Baltimore with elementary schools.

library(maps)
library(ggplot2)

balto_map = subset(map_data("county", region="maryland"),subregion=="baltimore city")

plt=ggplot()

# plot shape of Baltimore City
plt=plt+geom_polygon(data=balto_map,aes(x=long,y=lat),color="white",fill="gray40")

# plot violent crimes arrests in red
plt=plt+geom_point(data=violent_tab,aes(x=lon,y=lat),color="red",alpha=.1)

# plot elementary schools in blue
plt=plt+geom_point(data=school_tab,aes(x=x,y=y),color="blue")

print(plt)
## Warning: Removed 8800 rows containing missing values (geom_point).

What did you observe?:

Most schools are not too far from the locations of violent crimes arrests, the notable exceptions being the four schools in North Baltimore. The schools toward the middle of the city are more exposed to violent crimes arrests. These results are largely unsurprising, however, since both the locations of public elementary schools and violent crimes arrests might be more closely related to population density.

Deeper analyses would have to be made to correlate exposure to violent crimes and school-to-prison pipelines.

Ian Persons

What question are you asking?: What day of the week has the most crime?

What is the code you use to answer it?:

library(ggplot2)
ipersons=function()
  {
    arrest_tab = read.csv(file="BPD_Arrests.csv", head=TRUE)
    date <- strptime(arrest_tab$arrestDate, "%m/%d/%Y")
    arrest_tab$dow <- as.numeric(format(date, "%w"))
    
    dow_df <- as.data.frame(table(arrest_tab$dow))
    names(dow_df) <- c("Day", "Count")
    
    g <- ggplot(data=dow_df, aes(x=Day,y=Count, fill=Day)) + geom_bar(stat="identity")
    g <- g + labs(title = "Arrests During the Week", x="Day of the Week",y="Number of Arrests")
    g
  }
ipersons()

What did you observe? I observed that arrests peak in the middle of the week, with the weekend having the least arrests. It would be interesting to study the types of crimes committed throughout the week to see if all crimes follow this pattern or not.